Chi-square-based scoring function for categorization of MEDLINE citations
نویسندگان
چکیده
OBJECTIVES Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. METHODS Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. RESULTS Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms. CONCLUSIONS We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.
منابع مشابه
Side-chain modeling with an optimized scoring function.
Modeling side-chain conformations on a fixed protein backbone has a wide application in structure prediction and molecular design. Each effort in this field requires decisions about a rotamer set, scoring function, and search strategy. We have developed a new and simple scoring function, which operates on side-chain rotamers and consists of the following energy terms: contact surface, volume ov...
متن کاملChi Square Feature Extraction Based Svms Arabic Language Text Categorization System
This paper aims to implement a Support Vector Machines (SVMs) based text classification system for Arabic language articles. This classifier uses CHI square method as a feature selection method in the pre-processing step of the Text Classification system design procedure. Comparing to other classification methods, our system shows a high classification effectiveness for Arabic data set in term ...
متن کاملA Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive ...
متن کاملBrazilian knowledge production in the field of child and adolescent health.
OBJECTIVES To assess (a) the trend of MEDLINE citation of pediatrics articles associated with Brazilian institutions from 1990 through 2004; (b) the number of Brazilian pediatrics articles published in journals with the highest impact factor; and (c) the regional distribution of institutions. METHODS PubMed search limited to ages 0 to 18 years, English language, MEDLINE and humans subsets, Br...
متن کاملIdentification of comment-on sentences in online biomedical documents using support vector machines
MEDLINE® is the premier bibliographic online database of the National Library of Medicine, containing approximately 14 million citations and abstracts from over 4,800 biomedical journals. This paper presents an automated method based on support vector machines to identify a “comment-on” list, which is a field in a MEDLINE citation denoting previously published articles commented on by a given a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Methods of information in medicine
دوره 49 4 شماره
صفحات -
تاریخ انتشار 2010